Take-home Exercise 3

Create a data visualisation to segment kid drinks and other by nutrition indicators.

Sun Shengmei https://www.linkedin.com/in/shengmei-sun-9b262656/?originalSubdomain=sg (SMU MITB)https://scis.smu.edu.sg/master-it-business
2022-02-20

1.0 Introduction

In this task we are going to use heatmap and parallel coordinate plot to segment kid drinks and other by nutrition indicators.

Heatmaps visualise data through variations in coloring. Heatmaps are useful for cross-examining multivariate data, through placing variables in the columns and observation (or records) in rows and coloring the cells within the table. Heatmaps are good for showing variance across multiple variables, revealing patterns, displaying whether any variable are similar to others, and for detecting if any correlation exists.

Parallel coordinates plot is a data visualization specially designed for visualizing and analysing multivariate, numerical data. It is ideal for comparing multiple variables together and seeing the relationships between them. Parallel coordinates plot can be used to characterize clusters detected during segmentation.

For the purpose of this task, starbucks_drink.csv will be used.

2.0 Data and Considerations

3.0 Step-by-step Preparations

3.1 Installing and loading the required libraries

The code chunk below is to install the packages needed.

packages = c('seriation', 'dendextend', 'heatmaply', 'GGally',  'parallelPlot', 'tidyverse')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Data Import

The code chunk below imports starbucks_drink.csv into R environment by using read_csv() function of readr package.

df <- read_csv("data/starbucks_drink.csv") 

3.3 Data Wrangling

For multivariate visualization, values are usually scaled between 0 to 1 instead of using the absolute values. Therefore, we need to change the variables name and remove the unit to avoid the confusion.

df <- df %>%
  rename_with(~str_replace(., "\\([a-z]+\\)", ""))

We noticed that the caffeine content column is in character. Below code is to convert the characters to be numbers.

df$`Caffeine` = as.numeric(df$`Caffeine`)

let’s explore the drinks size.

size <- df %>%
  group_by(`Size`) %>%
  summarise(mean = mean(`Portion(fl oz)`), count = n()) %>%
  ungroup()

size
# A tibble: 8 x 3
  Size         mean count
  <chr>       <dbl> <int>
1 Grande      15.9    583
2 Kids         8       34
3 Short        8.35   206
4 Tall        12.1    583
5 Trenta Iced 30       73
6 Venti       19.9    218
7 Venti Iced  23.9    359
8 <NA>        10.6     12

As seen above, there are 7 named types of size. For large number of data, we will use “Grande” size in this study.

df <- df %>%
   filter(Size == 'Grande')

Next, we need to group the data by category, since there are many different drinks for each category. Mean() is used to summarize all the data.

df1 <- df[, c(1,4:15)] %>%
  group_by(Category) %>%
  summarise_all("mean") %>%
  ungroup()
df1
# A tibble: 8 x 13
  Category       Calories `Calories from ~ `Total Fat` `Saturated fat`
  <chr>             <dbl>            <dbl>       <dbl>           <dbl>
1 brewed-coffee      76.8            31.1        3.43            1.96 
2 espresso          233.             67.3        7.46            4.27 
3 frappuccino-b~    327.             72.2        7.96            4.85 
4 iced-coffee        55.2            11.3        1.19            0.654
5 iced-tea           82.3            10.8        1.23            0.769
6 kids-drinks-a~    328              96.7       10.6             5.96 
7 refreshers        112.              8.33       0.917           0.833
8 tea               139.             23.1        2.55            1.24 
# ... with 8 more variables: Trans fat <dbl>, Cholesterol <dbl>,
#   Sodium <dbl>, Total Carbohydrate <dbl>, Dietary Fiber <dbl>,
#   Sugars <dbl>, Protein <dbl>, Caffeine <dbl>

For the heatmaps, we need to change the rows index to category name from row number by using the code chunk below.

row.names(df1) <- df1$Category

Next, we need to convert the data into matrix format.

df1_matrix <- data.matrix(df1)

Now, our 1st cleaned data set df1_matrix is ready for plotting. Next, let’s filter out the data for kid-drinks-and-others category to look into more details.

kids_drinks <- df %>% 
  filter(Category == 'kids-drinks-and-other')

Below code is to create the unique name for each product.

kids_drinks <-  kids_drinks %>%
  mutate(Drinks =  paste(`Name`,`Milk`,`Whipped Cream`))

Next, let’s change the rows index to “Drinks” from row number by using the code chunk below. Then, we change the data into matrix.

row.names(kids_drinks) <- kids_drinks$Drinks
Kids_drinks_matrix <- data.matrix(kids_drinks)

Now, our Kids_drinks_matrix is ready for plotting.

3.4 Comparison bwtween kid-drinks-and-others category and other categories

3.4.1 Interactive Heatmap - heatmaply()

When analysing multivariate data set, it is very common that the variables in the data sets includes values that reflect different types of measurement. In general, these variables’ values have their own range. In order to ensure that all the variables have comparable values, data transformation are commonly used before clustering.

Three main data transformation methods are supported by heatmaply(), namely: scale, normalise and percentilse. For the purpose of this study, percentilse method is chosen given the small observation size and the benefit of being easy to interpret .

In order to determine the best clustering method and number of cluster the dend_expend() and find_k() functions of dendextend package will be used.

First, the dend_expend() will be used to determine the recommended clustering method to be used.

df1_d <- dist(percentize(df1_matrix[, -c(1)]), method = "euclidean")
dend_expend(df1_d)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.7782441
2      unknown        ward.D2 0.7812807
3      unknown         single 0.5482863
4      unknown       complete 0.7821805
5      unknown        average 0.7842891
6      unknown       mcquitty 0.7842107
7      unknown         median 0.7694487
8      unknown       centroid 0.7693485

The output table shows that “average” method should be used because it gave the high optimum value.

Next, find_k() is used to determine the optimal number of cluster.

df1_clust <- hclust(df1_d, method = "average")
num_k <- find_k(df1_clust)
plot(num_k)

Figure above shows that k=2 would be good.

With reference to the statistical analysis results, we prepare the code chunk as shown below.

heatmaply(percentize(df1_matrix[, -c(1)]),
          dist_method = "euclidean",
          hclust_method = "average",
          k_row = 2,
          Colv=NA,
          seriate ="OLO",
          colors = Blues,
          margins = c(NA,200,60,NA),
          fontsize_row = 8,
          fontsize_col = 8,
          xlab="Nutrition Indicators", 
          ylab="Drinks Category", 
          main="Starbucks Drinks Nutrition Indicators by Category \nDataTransformation using Percentising Method")

3.4.2 Static Parallel Coordinates Plot - ggparcoord()

In this section, we will explore the dataset using ggparcoord() of GGally package. ggparcoord() function doesn’t need matrix format of data frame. We can directly use the cleaned df dataset.

ggparcoord(data = df, 
           columns = c(4:15),
           groupColumn = 1,
           scale = "uniminmax",
           alphaLines = 0.2,
           boxplot = TRUE, 
           title = "Multiple Parallel Coordinates Plot of Starbucks Drinks Nutrition Indicators by Category") +
  facet_wrap(~ Category) +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

3.5 Drinks segmentation for kid-drinks-and-others category

3.5.1 Interactive Heatmap - heatmaply()

Using the same codes and steps as 3.4.1, below visualization of clusterings is created.

First, the dend_expend() will be used to determine the recommended clustering method to be used.

Kids_d <- dist(percentize(Kids_drinks_matrix[, c(4:15)]), method = "euclidean")
dend_expend(Kids_d)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.4983142
2      unknown        ward.D2 0.5628215
3      unknown         single 0.4165639
4      unknown       complete 0.5801873
5      unknown        average 0.5924295
6      unknown       mcquitty 0.5954184
7      unknown         median 0.3323152
8      unknown       centroid 0.3885602

The output table shows that “mcquitty” method should be used because it gave the high optimum value.

Next, find_k() is used to determine the optimal number of cluster.

kids_clust <- hclust(Kids_d, method = "mcquitty")
num_k <- find_k(kids_clust)
plot(num_k)

Figure above shows that k=10 would be good. To avoid big number of clusters and too granular grouping, next best number of clusters 8 is chosen.

With reference to the statistical analysis results, we prepare the code chunk as shown below.

heatmaply(percentize(Kids_drinks_matrix[, c(4:15)]),
          dist_method = "euclidean",
          hclust_method = "mcquitty",
          k_row = 8,
          Colv=NA,
          seriate ="OLO",
          colors = Blues,
          margins = c(NA,200,60,NA),
          fontsize_row = 7,
          fontsize_col = 7,
          xlab="Nutrition Indicators", 
          ylab="Drinks Category", 
          main="Starbucks Kids-Drinks Nutrition Indicators by Product \nDataTransformation using Percentising Method")

3.5.2 Interactive Parallel Coordinates - parallelPlot()

parallelPlot is an R package specially designed to plot a parallel coordinates plot by using ‘htmlwidgets’ package and d3.js. In order to visualize the different clusters, we need to add in the cluster numbers in the dataset.

kids_drinks_clust <- kids_drinks[, c(19,4:15)] 

clust_num <- data.frame(cutree(kids_clust, k=8, order_clusters_as_data=FALSE)) %>%
  rownames_to_column(var = "Drinks") %>%
  rename(`Cluster No.` = 2)

We need then to join above cluster No. data set with kids_drinks_clust data set by column “Drinks”

kids_drinks_clust <- left_join(kids_drinks_clust, clust_num)

Next, we can plot the paralle plot.

histoVisibility <- rep(TRUE, ncol(kids_drinks_clust))

parallelPlot(kids_drinks_clust,
             continuousCS = "YlOrRd",
             rotateTitle = TRUE,
             histoVisibility = histoVisibility)

4.0 Results Intepretation

From section 3.4, we can see that kids-drinks-and-other are in the same cluster as cappuccino-blended-beverages and espresso. Kids drink have the highest calories, calories from fat, total fat, saturated fat, trans fat, dietary fiber, and protein among all the drink categories. The only nutrition that kids drinks have less is caffeine, compared to other drinks categories.

From section 3.5, we have below observations:

5.0 Comparison of Heatmaps and Parallel Coordinates Plot

From section 3.4, both heatmap and parallel plot gave the same result that kids drinks, cappuccino-blended-beverages and espresso belong to the same cluster, which have the higher nutrition indicator values, compared to the other cluster.

The advantage of heatmap over parallel plot is that it is easier to read. Each observation is clearly labelled and identifiable with unique name.

The disadvantage of heatmap compared to parallel plot is that the data preparation process is more tedious. Matrix format data is required. Each observation need to have the unique name.

Interactive parallel plot allows readers to interact with the data, to select and to highlight the data. However, with large number of observations, the plot becomes messy and difficult to understand.

Overall, heatmap is the best solution for this analysis.